TIL6022 - Final Assignment¶

Members: Joris Voogt, Xiaoyu Mao, Yuwen Peng

Student numbers: 4295978, 5947642, 5914507

Contribution Statement¶

Joris Voogt: - Created the choropleth map for Europe to compare passenger km relative to the GDP.

         - Created the bar charts to show the percentage of road and rail passenger km relative to the total passenger km.
         - Created Part IV - Streamlit. 
         - Visualized, modelled, and analysed this data as seen on the [Streamlit application](ttps://til6022-project-l3z3ycqjbpwyyp5vvudwlw.streamlit.app/).

Xiaoyu Mao: background research, country grouping in data processing, optimization and improvement on code

Yuwen Peng: data collection, correlation analysis, visualisation, data visualisation

Research Objective¶

Requires data modeling and quantitative research in Transport, Infrastructure & Logistics

RQ:

  • What is the impact of road investments and maintenance on road safety in European countries?
  • Sub-questions: Do possible factors have different impacts on traffic accidents and casualties in high-income and low-income countries? Is the safety index of high-income countries inversely correlated with road investment? Contrary to low-income countries?
  • And what about the investment in road maintenance? Do other indicators that affect road safety exhibit different impacts?

This research will focus on passenger transport trends and safety issues in Europe. We will first analyse the relationship between transport mode choice and the economy in different countries. The GDP is used to represent the economic development of the country, on the basis of which we will analyse the changes in the scale of road and rail passenger transport in European countries in recent years. By analyzing the volume of rail passenger and road passenger transport , the proportion of both in the overall volume of transport can be told. Then discuss whether there are countries who are highly dependent on road transport, and then we select the countries who have a large proportion of road transport to analyze the relationship between GDP and road safety.

Time scale: Data is a bit limited for passenger transport, so most likely in the range of 2016-2021.

Geographical boundary: Countries in Europe. Not all of them have data available in the datasets, so countries with most data available will be analysed.

Data Pipeline¶

  1. The volume of passenger transport relative to GDP can give the relative scale of passenger transport in the national economy. Visualise it by creating an interactive map of Europe with different colors for each country based on the data, representing levels of the volume of passenger transport relative to GDP for different transport modes.

  2. Analyse the changes in the scale of road and rail passenger transport in European countries in recent years. By analysing the volume of rail passenger transport and road passenger transport in each country and drawing a scale diagram, we can find out the proportion of the two in the overall volume of transport; and drawing a line graph can show the changes in recent years.

  3. Further explore the impact of transport investment on transport safety. Discuss whether there are countries in 2 that are highly dependent on road transport, and then select countries with a large share of road transport to analyse the relationship between GDP and road safety.

  4. Look for other possible factors that could be related to traffic mortality rate. Such as proportions of motorways, GDP per capita, alcohol consumption. This data is likely leading to new relationships and possibly correlates. Therefore, a correlation matrix is used, measuring strength and quickly giving an overview of all the numerical correlations.

  5. Summarize the research and propose ideas about transport patterns in the future about road safety.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

Part I - Data Import¶

First step is to create new dataframes, clean the useless columns and rows from the original data, then rename the data columns and rows to be more clearly for the following data:

  • Volume of passenger transport relative to GDP https://ec.europa.eu/eurostat/databrowser/view/TTR00001/default/table?lang=en
  • Rail transport of passengers https://ec.europa.eu/eurostat/databrowser/product/view/TTR00015
  • Road transport of passengers https://ec.europa.eu/eurostat/databrowser/view/ROAD_PA_BUSCOA/default/table?lang=en
  • Road Infrustructure Maintenance Spending https://ec.europa.eu/eurostat/databrowser/view/ROAD_PA_MOV/default/table?lang=en
  • Road injury crashes, fatalities and injuries https://stats.oecd.org/
  • Road fatalities per one hundrend thousand inhabitants https://stats.oecd.org/
  • GDP per capita https://databank.worldbank.org/indicator/NY.GDP.PCAP.CD/1ff4a498/Popular-Indicators
  • other data related to road safety WHO, world bank, etc

Start off with the volume of passenger transport relative to GDP.

  • The indicator about this data is defined as the ratio between the wolume of inland passenger transport measured in passenger-kilometers and GDP.
  • It includes transport on national territory by passenger car, bus, coach and train.
  • It is a key economic indicator that provides insights into the relationship between the volume of passenger transportation servicesand the Gross Domestic Product (GDP) of a country or region.
In [2]:
# Import Volume of passenger transport relative to GDP: Read the csv file as a dataframe
file_path_pg_GDP = 'data/pg_GDP.csv'
df_pg_GDP = pd.read_csv(file_path_pg_GDP)

# Rename columns and rows
df_pg_GDP.columns = df_pg_GDP.iloc[6].tolist()
df_pg_GDP = df_pg_GDP.rename(columns={df_pg_GDP.columns[0]: 'GEO'})

# Clean the data by dropping columns of the dataframe that will not be used
# These rows are used to indicate the data, and columns are used to explain the accuracy of the data
df_pg_GDP = df_pg_GDP.iloc[8:48]
columns_to_remove = [2,4,6,8,10,12,14,16,18,20,22,24]
df_pg_GDP = df_pg_GDP.drop(df_pg_GDP.columns[columns_to_remove], axis=1 )
columns_to_remove = [7]
df_pg_GDP = df_pg_GDP.drop(df_pg_GDP.columns[columns_to_remove], axis=1 )

# Delete the rows without data and use Euro area or country as the row name
df_pg_GDP = df_pg_GDP.set_index('GEO')
df_pg_GDP = df_pg_GDP[~(df_pg_GDP == ':').all(axis=1)]

df_pg_GDP.head() # disaplay the first 5 rows of dataframe
Out[2]:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
GEO
European Union - 27 countries (from 2020) 103.7 102.2 100.9 100.8 100.1 100 100 98.3 96.6 95.4 79.2 80.2
Belgium 114.1 110.5 114 111.5 103.7 100 97.4 96.6 95.5 93.8 80.9 79.9
Bulgaria 91.5 91.2 91.9 94.4 99.1 100 95.9 93.9 89.7 88.7 75.4 72.3
Czechia 101.2 100 99.6 100.7 103.1 100 100.7 99.2 100.9 101.5 83.8 99.2
Denmark 98.7 99.5 98.7 97.7 98.1 100 99.8 98.6 97.2 98.2 88.4 85.9

The next is the rail and road passengers of each country.

  • They are two different ways of transportation that reflect different travel modes in different countries.
  • Analysis of rail and road passenger transport can provide insights into urban and regional development.
  • For the subsequent exploration of road safety, we can focus on countries with a large proportion of road passenger transportation.
In [3]:
# Import Rail transport of passengers: Read the csv file as a dataframe
file_path_rail_pg = 'data/rail_pg.csv'
df_rail_pg = pd.read_csv(file_path_rail_pg)

# Rename columns and rows
df_rail_pg.columns = df_rail_pg.iloc[7].tolist()
df_rail_pg = df_rail_pg.rename(columns={df_rail_pg.columns[0]: 'GEO'})

# Clean the data by dropping columns of the dataframe that will not be used
df_rail_pg = df_rail_pg.iloc[9:49]
columns_to_remove = [2,4,6,8,10,12,14,16,18,20,22,24]
df_rail_pg = df_rail_pg.drop( df_rail_pg.columns[columns_to_remove], axis=1 )

# Delete the rows without data and reset the index
df_rail_pg = df_rail_pg[~(df_rail_pg == ':').all(axis=1)]
df_rail_pg = df_rail_pg.reset_index(drop = True)

df_rail_pg.head()
Out[3]:
GEO 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
0 European Union - 27 countries (from 2020) : : : : 375,713 384,335 394,142 400,427 413,923 223,668 260,715 :
1 Euro area – 20 countries (from 2023) : : : : : : : : : : : :
2 Euro area - 19 countries (2015-2022) : : : : : : : : : : : :
3 Belgium 10,498 : : : : : : : : : : :
4 Bulgaria 2,059 1,870 1,821 1,698 1,549 1,455 1,434 1,476 1,520 1,118 1,203 1,600
In [4]:
# Import Rail transport of passengers: Read the csv file as a dataframe
file_path_road_pg = 'data/road_pg.csv'
df_road_pg = pd.read_csv(file_path_road_pg)

# Rename columns and rows
df_road_pg.columns = df_road_pg.iloc[7].tolist()
df_road_pg = df_road_pg.rename(columns={df_road_pg.columns[0]: 'GEO'})

# Clean the data by dropping columns of the dataframe that will not be used
# After observation, it was found that they are all sequence numbers with double digits
# Remove double numbered columns starting from the fourth column with index 3
df_road_pg = df_road_pg.iloc[9:37]
columns_to_keep = df_road_pg.columns[:1].to_list()
columns_to_remove = df_road_pg.columns[1::2]
df_road_pg = df_road_pg[columns_to_keep + columns_to_remove.to_list()]

# Delete the rows without data and reset the index
df_road_pg = df_road_pg[~(df_road_pg == ':').all(axis=1)]
df_road_pg = df_road_pg.reset_index(drop = True)

df_road_pg.head()
Out[4]:
GEO 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 Belgium 135,070 132,615 : : : : : : : :
1 Bulgaria 8,949.9 8,666.600 9,919.700 10,017.800 9,548.600 8,916.900 8,017.500 8,952.700 4,803 4,611.600
2 Czechia 78,719 79,382 : 84,829 87,257 90,049 94,193 97,320 77,771 96,722
3 Denmark : 66,794 68,140 70,962 73,393 74,468 74,934 : 69,742 70,835
4 Germany 990,582 998,566 1,013,748 1,027,500 1,048,003 1,015,350 1,013,700 996,500 866,646 846,687

Next is annual road investment and maintenace of each country. The data cover the amount of money spent annually measured in euros.Later on, we will focus on analyzing their relationship with road safety.

In [59]:
# Import Road Infrustructure Investment Spending: Read the csv file as a dataframe
file_path_road_inv = 'data/road_inv.csv'
df_road_inv = pd.read_csv(file_path_road_inv)

# First clean the data by dropping columns of the dataframe that will not be used
df_road_inv = df_road_inv.iloc[4:56]
# Remove double numbered columns starting from the fourth column with index 3
columns_to_keep = df_road_inv.columns[:1].to_list()
columns_to_remove = df_road_inv.columns[3::2]
df_road_inv = df_road_inv[columns_to_keep + columns_to_remove.to_list()]

# Rename columns and rows
new_column_names = [str(year) for year in range(1999, 2022)]
df_road_inv.columns = new_column_names
df_road_inv = df_road_inv.rename(columns={df_road_inv.columns[0]: 'GEO'})

# Delete the rows without data and reset the index
df_road_inv = df_road_inv[~(df_road_inv == '..').all(axis=1)]
df_road_inv = df_road_inv.reset_index(drop = True)

df_road_inv.head()
Out[59]:
GEO 2000 2001 2002 2003 2004 2005 2006 2007 2008 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 Albania 108,347,272 107,567,894 64,458,009 60,782,802 91,078,262 68,193,896 175,223,508 253,261,449 499,989,410 ... 180,820,940 234,163,844 192,718,553 179,236,201 89,140,344 159,032,880 169,252,490 175,879,111 223,292,017 265,726,260
1 Australia 3,985,384,644 3,232,914,774 3,174,141,910 3,809,467,524 4,673,112,063 6,047,684,590 6,696,458,225 7,813,640,359 7,904,953,310 ... 15,649,722,083 13,990,189,647 10,988,504,646 10,310,609,857 10,841,616,558 12,673,185,564 14,363,912,675 13,317,666,734 12,671,465,384 13,286,823,655
2 Austria 475,000,000 640,000,000 532,000,000 650,000,000 720,000,000 687,000,000 802,000,000 870,000,000 875,000,000 ... 327,000,000 363,000,000 453,000,000 455,000,000 444,000,000 515,000,000 463,000,000 562,000,000 548,000,000 480,000,000
3 Azerbaijan 30,255,355 47,961,631 46,789,989 34,212,659 48,289,409 82,349,945 260,388,800 374,213,302 1,328,480,437 ... 1,479,179,001 1,913,627,639 1,411,301,928 873,204,688 498,158,118 557,071,513 695,011,451 844,323,018 757,138,439 1,208,250,497
4 Belarus .. .. .. .. .. 537,113,018 601,559,599 781,101,292 933,142,311 ... 1,581,319,399 1,446,139,503 1,357,658,853 1,007,813,385 872,967,572 960,754,682 1,004,655,803 1,157,793,622 .. ..

5 rows × 23 columns

In [60]:
# Import Road Infrustructure Maintenance Spending: Read the csv file as a dataframe
file_path_road_maint = 'data/road_maint.csv'
df_road_maint = pd.read_csv(file_path_road_maint)

# Rename columns and rows
df_road_maint.columns = df_road_maint.iloc[2].tolist()
df_road_maint = df_road_maint.rename(columns={df_road_maint.columns[0]: 'GEO'})

# Clean the data by dropping columns of the unit that is meaningless when the selected country are all from Europe
# Then remove the rows from original data for explaination
df_road_maint = df_road_maint.iloc[4:46]
columns_remove = [1]
df_road_maint = df_road_maint.drop( df_road_maint.columns[columns_remove], axis=1 )

# Delete the rows without data and reset the index
df_road_maint = df_road_maint[~(df_road_maint == '..').all(axis=1)]
df_road_maint = df_road_maint.reset_index(drop = True)

df_road_maint.head()
Out[60]:
GEO 2000 2001 2002 2003 2004 2005 2006 2007 2008 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 Albania 3,838,572 5,705,714 6,824,432 5,767,021 6,829,499 7,110,073 5,670,979 6,046,008 8,056,200 ... 6,652,853 8,745,595 15,262,572 8,374,520 13,004,791 13,593,700 13,444,559 13,050,669 9,058,683 21,515,139
1 Armenia .. .. .. .. .. 8,770,105 9,588,976 10,683,530 11,111,985 ... 10,706,482 10,085,012 10,050,491 11,236,616 11,586,347 .. .. .. .. ..
2 Austria 508,000,000 520,000,000 294,000,000 388,000,000 458,000,000 443,000,000 495,000,000 486,000,000 467,000,000 ... 517,000,000 559,000,000 667,000,000 692,000,000 697,000,000 687,000,000 726,000,000 752,000,000 791,000,000 899,000,000
3 Azerbaijan .. 15,587,530 15,233,950 17,106,329 34,375,512 33,109,772 54,396,290 31,467,937 34,742,328 ... 34,537,201 31,669,866 31,660,750 22,909,507 18,702,182 27,802,090 25,390,819 32,049,598 52,056,489 46,719,682
4 Belgium 59,000,000 62,000,000 68,000,000 73,000,000 77,000,000 80,000,000 104,000,000 94,000,000 102,000,000 ... 145,000,000 147,000,000 206,000,000 457,000,000 528,000,000 396,797,000 216,000,000 213,000,000 155,000,000 498,000,000

5 rows × 23 columns

Then there is the inport of annual fatalities and casualties of each country.

  • They can two indicators for measuring road safety. And they provide information on different aspects of road traffic accidents.
  • The first one is road fatalities per one hundrend thousand inhabitants. The second is road injury crashes, fatalities and injuries.
In [55]:
# Import Road injury crashes, fatalities and injuries: Read the csv file as a dataframe
df_inj_kill = pd.read_csv('data/inj_kill.csv')

# First clean the data by dropping columns of the dataframe that will not be used
df_inj_kill = df_inj_kill.iloc[4:60]
# Remove double numbered columns starting from the fourth column with index 3
columns_to_keep = df_inj_kill.columns[:1].to_list()
columns_to_remove = df_inj_kill.columns[3::2]
df_inj_kill = df_inj_kill[columns_to_keep + columns_to_remove.to_list()]

# Rename columns and rows
new_column_names = [str(year) for year in range(1999, 2022)]
df_inj_kill.columns = new_column_names
df_inj_kill = df_inj_kill.rename(columns={df_inj_kill.columns[0]: 'GEO'})
# Reset the index
df_inj_kill = df_inj_kill.reset_index(drop = True)

df_inj_kill.head()
Out[55]:
GEO 2000 2001 2002 2003 2004 2005 2006 2007 2008 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 Albania 620 547 478 510 1,110 1,180 1,342 1,728 1,554 ... 2,569 2,798 2,617 2,692 2,779 2,611 2,291 2,044 1,598 1,860
1 Argentina .. .. .. .. .. .. .. .. .. ... 122,062 118,925 85,984 96,564 107,852 118,593 109,594 112,065 .. ..
2 Armenia 1,377 1,495 1,448 1,546 1,751 2,084 2,421 3,091 3,532 ... 4,050 4,310 4,776 5,084 4,718 5,458 5,234 5,103 .. ..
3 Australia 28,591 29,066 29,483 29,891 30,330 32,069 33,740 34,005 34,839 ... 35,367 36,322 36,703 38,286 40,237 40,553 40,732 40,649 .. ..
4 Austria 55,905 57,223 57,640 57,812 56,735 54,002 52,660 53,902 51,200 ... 51,426 48,499 48,100 47,845 48,825 47,672 46,934 45,556 38,074 41,251

5 rows × 23 columns

In [77]:
# Import Road fatalities per one hundrend thousand inhabitants: Read the csv file as a dataframe
file_path_fat_inh = 'data/fat_inh.csv'
df_fat_inh = pd.read_csv(file_path_fat_inh)

# First clean the data by dropping columns of the dataframe that will not be used
# These rows are used to indicate the data, and columns are used to explain the accuracy of the data
df_fat_inh = df_fat_inh.iloc[3:62]

# Remove double numbered columns starting from the fourth column with index 3
columns_to_keep = df_fat_inh.columns[:1].to_list()
columns_to_remove = df_fat_inh.columns[2::2]
df_fat_inh = df_fat_inh[columns_to_keep + columns_to_remove.to_list()]

# Rename columns and rows
new_column_names = [str(year) for year in range(1999, 2022)]
df_fat_inh.columns = new_column_names
df_fat_inh = df_fat_inh.rename(columns={df_fat_inh.columns[0]: 'GEO'})
# Reset the index
df_fat_inh = df_fat_inh.reset_index(drop = True)

df_fat_inh.head()
Out[77]:
GEO 2000 2001 2002 2003 2004 2005 2006 2007 2008 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 Albania 9.1 9.7 8.2 8.6 10.4 10.2 9.3 12.9 10.3 ... 11.5 10.2 9.1 9.4 9.4 7.7 7.4 8.0 6.4 7
1 Argentina .. .. .. .. .. .. .. .. 14.3 ... 12.2 12.3 12.4 12.6 12.8 12.3 12.4 12.5 .. ..
2 Armenia 6.8 7.6 7.6 8.2 8.4 10.2 11 12.3 13.6 ... 10.7 10.9 10.3 12.0 9.3 9.8 9.9 9.1 .. ..
3 Australia 8.6 8.2 7.8 7.3 7.2 7.3 7.1 7 6.2 ... 5.6 5.5 4.9 5.1 5.3 5.0 4.5 4.7 4.3 4.3
4 Austria 12.2 11.9 11.8 11.5 10.7 9.3 8.8 8.3 8.2 ... 6.3 5.4 5.0 5.5 4.9 4.7 4.6 4.7 3.9 4

5 rows × 23 columns

  • Add more data to analyse the factors that possibly related to safty Alcohol consumption from WHO only collected from 2010 to 2019, so the other data also need to be capture to this period.
In [9]:
# Import GDP per capita: Read the csv file as a dataframe
file_path_GDP = 'data/GDP.csv'
df_GDP = pd.read_csv(file_path_GDP)

# Remove the last row and use European region or country as the row name
df_GDP = df_GDP.iloc[:58]
df_GDP = df_GDP.rename(columns={df_GDP.columns[0]: 'GEO'})
#df_GDP = df_GDP.set_index('GEO')

df_GDP.head()
Out[9]:
GEO 2000 2001 2002 2003 2004 2005 2006 2007 2008 ... 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
0 Albania 1,126.70 1,281.70 1,425.10 1,846.10 2,373.60 2,673.80 2,972.70 3,595.00 4,370.50 ... 4,413.10 4,578.60 3,952.80 4,124.10 4,531.00 5,287.70 5,396.20 5,343.00 6,377.20 6,802.80
1 Andorra 21,620.50 22,809.10 24,785.00 31,954.20 37,630.80 39,583.90 43,084.30 50,562.90 53,721.40 ... 44,747.80 45,680.50 38,885.50 39,931.20 40,632.20 42,904.80 41,328.60 37,207.20 42,072.30 41,992.80
2 Armenia 603.3 676.2 765.3 910.2 1,166.60 1,608.20 2,109.50 3,064.30 3,908.90 ... 3,833.20 4,017.20 3,666.10 3,680.00 4,042.00 4,391.90 4,828.50 4,505.90 4,966.50 7,014.20
3 Austria 24,625.60 24,558.80 26,527.60 32,294.00 36,889.20 38,417.50 40,669.30 46,915.30 51,920.00 ... 50,731.10 51,786.40 44,195.80 45,307.60 47,429.20 51,466.60 50,070.40 48,809.20 53,637.70 52,131.40
4 Azerbaijan 655.1 703.7 763.1 883.7 1,045.00 1,578.40 2,473.00 3,851.30 5,574.50 ... 7,875.30 7,890.80 5,500.50 3,880.70 4,147.20 4,739.80 4,805.80 4,229.90 5,408.00 7,736.70

5 rows × 24 columns

  • Add more data to analyse the factors that possibly related to safety Alcohol consumption from WHO only collected from 2010 to 2019, so the other data also need to be capture to this period.
In [119]:
# Import all the European countries names
countries_iso = pd.read_csv('data/countries_iso.csv')
countries = countries_iso.iloc[:, 1].tolist()

# Import Proportion of motorways over the total road network(%),2010-2021
df_mot_rate = pd.read_csv('data/mot_rate.csv')
df_mot_rate = df_mot_rate.iloc[:,:11]
df_mot_rate.head()

# Import Motorization index(in cars per 1000 inhabitants),2000-2021
df_mot_index = pd.read_csv('data/mot_index.csv')
df_mot_index = df_mot_index.iloc[:,0:1].join(df_mot_index.iloc[:,11:21], how='outer')
df_mot_index.head()

# Import Unemployment rate(%),2010-2022
df_unempl_rate = pd.read_csv('data/unempl_rate.csv')
df_unempl_rate = df_unempl_rate.iloc[:,:11]
df_unempl_rate.head()

# Import alcohol consumption(%),2010-2021
df_alcohol = pd.read_csv('data/alcohol.csv')
df_alcohol = df_alcohol.iloc[:,0:1].join(df_alcohol.iloc[:,11:], how='outer')
df_alcohol.head()

# Reselect the range of other five data sources for casualties
fat_inh = df_fat_inh.iloc[:,0:1].join(df_fat_inh.iloc[:,11:21], how='outer').replace('..',np.nan)
inj_kill = df_inj_kill.iloc[:,0:1].join(df_inj_kill.iloc[:,11:21], how='outer').replace('..',np.nan)

road_inv = df_road_inv.iloc[:,0:1].join(df_road_inv.iloc[:,11:21], how='outer').replace('..',np.nan)
road_maint = df_road_maint.iloc[:,0:1].join(df_road_maint.iloc[:,11:21], how='outer').replace('..',np.nan)

GDP = df_GDP.iloc[:,0:1].join(df_GDP.iloc[:,11:21], how='outer').replace('..',np.nan)

df_alcohol.head()
Out[119]:
GEO 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 Albania 4.88 5.03 4.43 4.28 4.40 4.33 4.38 4.39 4.44 4.40
1 Andorra 10.53 10.46 10.33 10.06 9.88 11.02 11.18 11.18 10.72 10.99
2 Armenia 4.23 4.07 3.89 3.92 4.22 4.04 3.83 3.83 3.69 3.77
3 Austria 12.10 11.90 12.10 12.10 12.40 11.60 11.70 11.60 11.60 11.90
4 Azerbaijan 0.99 0.60 0.52 0.52 0.36 0.51 0.39 0.44 0.88 1.38

Part II - Data processing¶

  • This part is preparation for the subsequent quantitative analysis and data visualization.

Divide countries based upon GDP:

In [11]:
# Divided the coutries into two groups: low-income and high-income countries, basing on GDP per captia
# Select the last five years as evaluation criteria
selected_years = ['2018','2019','2020','2021','2022']
data_selected_years = df_GDP[selected_years]

# Replace invalid values in the dataframe with NAN
data_selected_years = data_selected_years.replace('..', np.nan)
# Convert numerical values to floats
data_selected_years = data_selected_years.replace(',', '', regex=True).astype(float)

# Calculate the average GDP of each country
data_average_GDP = data_selected_years.mean(axis=1)
df_GDP['average GDP'] = data_average_GDP

# Calculate the average GDP of all countries
average_GDP = df_GDP['average GDP'].mean()
print(f'The average GDP of all the countries in last five years is {average_GDP}')
# Add a new column to mark their level of income: high or low
df_GDP['Income Group'] = ''
df_GDP.loc[df_GDP['average GDP'] > average_GDP,'Income Group'] = 'High Income'
df_GDP.loc[df_GDP['average GDP'] < average_GDP,'Income Group'] = 'Low Income'

# Create lists to devide the two groups of countries
High_inc = df_GDP[df_GDP['Income Group'] == 'High Income']['GEO'].to_list()
Low_inc = df_GDP[df_GDP['Income Group'] == 'Low Income']['GEO'].to_list()

# Print two lists of countries one by one
if True:
    print('\033[1mHigh Income countries includes:\033[0m')
    for country in High_inc:
        print(country, end = ',')
    print('\n\033[1mLow Income countries includes:\033[0m')
    for country in Low_inc:
        print(country, end = ',')
The average GDP of all the countries in last five years is 36946.38868421052
High Income countries includes:
Andorra,Austria,Belgium,Channel Islands,Denmark,Faroe Islands,Finland,France,Germany,Greenland,Iceland,Ireland,Isle of Man,Liechtenstein,Luxembourg,Monaco,Netherlands,Norway,San Marino,Sweden,Switzerland,United Kingdom,
Low Income countries includes:
Albania,Armenia,Azerbaijan,Belarus,Bosnia and Herzegovina,Bulgaria,Croatia,Cyprus,Czechia,Estonia,Georgia,Greece,Hungary,Italy,Kazakhstan,Kosovo,Kyrgyz Republic,Latvia,Lithuania,Moldova,Montenegro,North Macedonia,Poland,Portugal,Romania,Russian Federation,Serbia,Slovak Republic,Slovenia,Spain,Tajikistan,Turkiye,Turkmenistan,Ukraine,Uzbekistan,
In [13]:
# The years we choose to analyze (based upon availability)
chosen_years = [str(i) for i in range(2013, 2022)]
# Keep the country column
chosen_years.insert(0, 'GEO')

# The countries we choose to analyze (based upon availability)
chosen_countries = ('Austria','Bulgaria', 'Denmark', 'Germany', 'France', 'Finland','Czechia',  'Croatia', 'Italy', 'Lithuania',  'Poland', 'Slovakia')

# Keep only the chosen years and countries in the road and rail passenger km data sets
df_road = df_road_pg[df_road_pg['GEO'].isin(chosen_countries)][chosen_years]
df_road.reset_index(drop=True, inplace=True)
df_rail = df_rail_pg[df_rail_pg['GEO'].isin(chosen_countries)][chosen_years]
df_rail.reset_index(drop=True, inplace=True)

# Combine road and rail data
df_psk = pd.DataFrame(columns=['Year', 'Country', 'Rail_psk', 'Road_psk', 'Total_psk', 'Percentage road', 'Percentage rail'])

for row in range(0, len(df_road.index)):  
    for col in range(1, len(df_road.columns)):
        
        # Remove commas and change : to NaN
        rail_val = float(df_rail.iloc[row, col].replace(',', '').replace(':', 'NaN'))
        road_val = float(df_road.iloc[row, col].replace(',', '').replace(':', 'NaN'))
        
        df_psk.loc[len(df_psk)] = [df_road.columns[col], 
                                   df_road.iloc[row, 0],
                                   rail_val,
                                   road_val,
                                   rail_val+road_val,
                                   road_val/(rail_val+road_val)*100,
                                   rail_val/(rail_val+road_val)*100
                                   ]

# Show dataframe
df_psk
Out[13]:
Year Country Rail_psk Road_psk Total_psk Percentage road Percentage rail
0 2013 Bulgaria 1821.0 8666.6 10487.6 82.636638 17.363362
1 2014 Bulgaria 1698.0 9919.7 11617.7 85.384370 14.615630
2 2015 Bulgaria 1549.0 10017.8 11566.8 86.608224 13.391776
3 2016 Bulgaria 1455.0 9548.6 11003.6 86.777055 13.222945
4 2017 Bulgaria 1434.0 8916.9 10350.9 86.146132 13.853868
... ... ... ... ... ... ... ...
103 2017 Finland 4271.0 76002.0 80273.0 94.679407 5.320593
104 2018 Finland 4535.0 75961.0 80496.0 94.366180 5.633820
105 2019 Finland 4924.0 75861.0 80785.0 93.904809 6.095191
106 2020 Finland 2820.0 72061.0 74881.0 96.234025 3.765975
107 2021 Finland 2903.0 70760.0 73663.0 96.059080 3.940920

108 rows × 7 columns

Part III - Data Visualization¶

Eurostat to Map data¶

  • This next block of code transforms the volume of passengers relative to GDP data to a dataframe which can be used in a choropleth map. The color variations representing different levels of the data.
  • A gradient color spectrum can be used so that countries with high ratios are shown in darker colors and countries with low ratios are shown in lighter colors.
  • Add a timeline at the bottom or side of the map to represent the time range. Users can drag the timeline or click on different years to see the Volume of passenger transport relative to GDP for each country at different points in time.
  • Parts of this can be reused for other eurostat data sets.

Choropleth Map using Plotly¶

This next block of code creates a map of Europe using plotly's choropleth. In this particular instance, it plots transport volumes relative to GDP. Most of this code can be reused for plotting other data sets using choropleth.

In [12]:
# Read in prepared data set, rename a column and extract the different years
df = pd.read_csv('data/pg_GDP_map.csv')
df = df.rename(columns={'pg_GDP': 'index'})
years = df.year.unique()

# Calculate max value for continuous colour range
max_pg_gdp = df['index'].max()

# Country names to be used in the labeling
customdata = df['country']

# Create choropleth figure
fig = px.choropleth(df,
                    locations='iso_alpha',  # Iso-alpha-3 codes to signify which country
                    color='index',
                    hover_name='country',
                    color_continuous_scale=[[0, 'gray'], [0.01, 'gray'], [0.01, 'blue'], [1, 'red']],  # Gray represents countries with no available data.
                    projection='miller',
                    range_color=(0,max_pg_gdp),
                    scope='europe',
                    animation_frame='year',
                    animation_group='country'
                    )

# Sets hover data
fig.update_traces(customdata=customdata, 
                  hovertemplate= np.select([df['index'] == 0], ['<b>Country: </b> %{customdata}<br><br>No data available'], '<b>Country: </b>%{customdata}<br><br><b>Index value: </b>%{z}'))

# Gets value for the entire EU
numb = df.query('country=="EU"')['index']

# Get current year
current_year = fig.layout.sliders[0].steps[0].label

# Set title and EU index and legend position
fig.update_layout(title='<b>Volume of passengers relative to GDP in ' + current_year + ' (2015 base year)</b><br>EU index: ' + str(numb.tolist()[0]),
                   coloraxis_colorbar=dict(x=0,
                                           y=0.5)
                  )

# Set figure size
fig.update_layout(width=1000,
                  height=600)

fig.show()

A high volume of passenger transport relative to GDP often indicates a robust and growing economy. When the volume of passenger transport is significant in comparison to the GDP, it suggests that the transportation sector is making a substantial contribution to economic growth.
A high ratio may suggest that a country's economy relies heavily on transportation services, such as tourism, logistics, or commuting. In this case, fluctuations in transportation services may have a substantial impact on the overall economic health of the country.
A low ratio relative to GDP might suggest that a country or region's transportation system is highly efficient. In such cases, the transportation sector doesn't need to consume a significant portion of the economic output to move people efficiently, indicating well-developed and cost-effective transport infrastructure.

  • The data on the "Volume of passenger transport relative to GDP" in the European Union from 2010 to 2022, with 2015 defined as 100, can be analyzed as follows:

1. Positive Trend: From 2010 to 2015, the ratio increased from 103.7 to 100, where 2015 was defined as 100. This indicates a relative increase in the volume of passenger transport in relation to GDP during this period. The passenger transport sector was playing an increasingly significant role in the European economy. This have been due to factors such as increased tourism, or infrastructure development and investment.

2. Stability: From 2016 to 2018, the ratio remained relatively stable, hovering around 100. This suggests that the passenger transport sector was maintaining its level of contribution to the economy, with no significant shifts in either direction.

3. Decline: From 2019 onwards, there was a notable decline. By 2022, the ratio had decreased to 80.2. This decline indicates a reduction in the relative contribution of the passenger transport sector to the GDP. This reduction could be influenced by various factors, such as changes in travel behavior (e.g.due to the COVID-19 pandemic), economic shifts, or increased efficiency in the transportation sector. No data augmentation after the pandemicmay because many people adapted to remote work and virtual meetings. As more people work from home, the need for daily commuting and business travel may decrease, impacting the passenger transport sector. Many businesses in the travel and tourism sectors suffered financial losses.Reduced business travel and tourism can lead to lower passenger transport relative to GDP.

Percentage road and rail passenger kilometres¶

Next, we look at the percentage of total passenger kilometres that is road (buses and coaches) and rail kilometres. We do this by plotting the dataframe created in part II as a bar chart with animation frames for the years.

In [14]:
# Remove Denmark as it has no entries for road_psk
df_psk_bar = df_psk.drop(df_psk[df_psk['Country'] == 'Denmark'].index)

# Create bar chart with animation frames for years
fig_bar = px.bar(df_psk_bar,
                 x='Country',
                 y=['Percentage road', 'Percentage rail'],
                 animation_frame='Year',
                 animation_group='Country'
                 )

# Set bar chart layout
fig_bar.update_layout(title='Percentage road and rail passenger km in millions',
                      yaxis_title='Percentage',
                      legend_title_text='',
                      )

# Set hovertemplate
fig_bar.update_traces(hovertemplate='<b>Country:</b> %{x}<br>'
                                    '<b>Percentage:</b> %{y:,.1f}'
                                    '<extra></extra>')

# Show bar chart
fig_bar.show()

To more easily compare the road and rail passenger kilometres for a singular country, we create a bar chart for a user supplied country below:

In [15]:
# Get the countries which are available and ask user for one of these
available_countries = ', '.join(df_psk_bar['Country'].unique()) 
chosen_country = input('Choose a country from: ' + available_countries)

# Filter dataframe on chosen country
df_psk_bar_country = df_psk_bar[df_psk_bar['Country'] == chosen_country]

# Create bar chart with animation frames for years
fig_bar = px.bar(df_psk_bar_country,
                 x='Year',
                 y=['Percentage road', 'Percentage rail'],
                 )

# Set bar chart layout
fig_bar.update_layout(title='Percentage road and rail passenger km in millions in ' + chosen_country,
                      yaxis_title='Percentage',
                      legend_title_text='',
                      )

# Set hovertemplate
fig_bar.update_traces(hovertemplate='<b>Year:</b> %{x}<br>'
                                    '<b>Percentage:</b> %{y:,.1f}'
                                    '<extra></extra>')

# Show bar chart
fig_bar.show()
Choose a country from: Bulgaria, Czechia, Germany, France, Croatia, Italy, Lithuania, Austria, Poland, Slovakia, FinlandFrance

From the above output images, the overall conclusion can be summarized as follows:

  1. High-income countries
  • Austria: The road percentage remained stable from 2013 to 2017 between 88.5% and 88.8%.
  • Germany: The road percentage remained stable at 91.7% from 2013-2016, with a slight decrease from 2017 to a minimum value of 90.9% in 2019, grew to 93.7% and remaining stable from 2020.
  • France: The road percentage stayed above 90% until 2018, then decreased in 2019 to a minimum of 89.8%, and increased again to 92.3% in 2020, with a slight decrease in 2021 to 91.0%.
  • Finland: The road percentage fluctuated between 94.4% and 95.0% from 2013-2017, reached a low of 93.9% in 2019 and started to increase to a high of 96.2% in 2020.
  1. Low-income countries
  • Bulgaria: The road percentage was an increasing trend from 2013 to 2016, increased from 82.6% until 2016 when it reaches a maximum value of 86.8%, then fluctuated around 85% from 2017-2019, and then decreased all the way down to 79.3% in 2021.
  • Czechia: The road percentage stayed above 91.3% until 2015, started to decrease in 2016 reaching a low of 90.0% in 2019, then started to increase in 2020 reaching a high of 93.5% in 2021.
  • Crotia: The road percentage stayed at 97.1% from 2013 to 2015, then fluctuated between 97.4% to 97.7% from 2016 to 2019, with an increase to 98.1% in 2020 and stayed the same thereafter.
  • Italy: The road percentage fluctuated between 94.1% to 94.4% from 2013 to 2017, then decreased somewhat to a minimum of 90.0% in 2018, increased to a maximum of 96.3% in 2020, with a slight decrease in 2021.
  • Lithuania: The road percentage stayed remained above 98.9% with a maximum of 99.2%.
  • Slovakia:The road percentage is the lowest one of all countries.The road percentage was maintained at 67.2% from 2013 to 2014, then has been decreasing until 2019 whrn it reached a minimum of 56.8%,then it started to increase in 2020 and reached a maximum of 63.1% in 2021.

Analysis:

Analyzing the above trends for each country, we observe the following trends and changes:

  1. Large economies such as Germany and France show a high percentage of road travel, probably due to their highly developed road network and vehicle penetration. Meanwhile, the Nordic country Finland, shows a relatively higher percentage of road trips, which may be related to its dispersed population and people's preference for road travel. But Slovakia, shows a relatively low percenatage of road travel, which may be due to the fact that Slovakia has a well-developed railway infrastructure and an extensive railway network, which makes people more willing to travel by rail.

  2. Overall, we notice that eastern European countries make more use of buses and coaches compared to western European countries. Of course, this does depend on multiple factors.

    • Environment:
      Is the country hilly/mountainous or is it flat? Is there space in cities for buses/coaches or is an underground rail network a better option?
    • Time:
      One can assume it is easier and cheaper on the short term to add buses and coaches in the travel infrastructure than it is to add a rail network. However, on the long term, a rail network could bring in more people due to its efficiency and potential to be controlled remotely.
    • GDP:
      The income of a country also affects what infrastructure can be build. Typically, a rail network is more expensive than creating a bus network. However, the data shows that Italy, which we defined as a low income country, uses pretty much only rail travel when compared to bus travel. And Finland, which we defined as a high income country, uses more road travel. This of course could be because of other factors.
  3. 2013-2021 can be broadly categorised into three periods based on the above trends in each country:

    • 2013-2016, Stable period
      During this period, the percentage of road travel is stable in high-income countries.Low-income countries are also stable,except for Bulgaria,which countinues to grow.
    • 2017-2019, Decline period
      In this period, high-income countries all reach the lowest percentage in 2019, low-income countries are all declining or fluctuating trend.
    • 2020, COVID-19 impact period
      In this period, because of the impact of the COVID-19, almost all countries in this year, the percentage of road travel generally increased, people are more willing to choose private transport or road public transport of low population density travel mode.
    • 2021, COVID-19 recovery period
      During this period, the percentage of road travel in high-income countries is either declining or unchanged, while in most low-income countries it continues to rise or remains unchanged, probably because the COVID-19 in high-income countries was contained relatively faster and people began to return to normal life more quickly.

Research on road safety, with a focus on road expenditure¶

  • This part is focus on the relationship between expenditure on highway investment and maintenance.
  • The result graph can be seen from the connection below

https://til6022-project-l3z3ycqjbpwyyp5vvudwlw.streamlit.app/

Analysis:

  1. Investment in road construction during the current year shows an inverse relationship with the mortality rate in HIC, while it shows a direct relationship in LIC. This is an interesting phenomenon, and verified our hypothesis in RQ. Thus, in addition to the improvements of accessibility and territorial cohesion that road investments bring to HICs, they contribute to reduction of the mortality rate. This benefcial efect on road safety may be due to the great improvement brought about by the construction of new high capacity roads and the dualling of two-lane roads.

  2. Regarding maintenance expenditure, the results for HICs and LICs coincide in terms of the reductive efect—though greater in LIC than in HIC, as well as in the level of signifcance of the variable overall. This indicates that an increase in investment in road maintenance will increase the number of fatalities and injuries. This indirectly reflects the shortcomings of maintenance policies and the lack of consideration for road safety.

Advance Research on road safety¶

  • Due to the complexity of the subject of road safety, which is infuenced by a multitude of aspects, further independent variables could be incorporated to help control for certain factors related to the road network, national alcohol consumption, Economic development, etc.

Correlation Analysis¶

Variables & Factors:

Choose the data of fatalities, injurie s& killed as two important dependent variables. Then choose other data including road investment, maintanace, motorway proportion, motorization rate, alcohol consumption, unemployment rate, GDP per capita as seven factors that could affect the changes of variables.

Method:

These data is likely leading to new statements or locksdowns and the data possibility correlates. The relation will therefore be explored by correlation analysis.

  • This can best be done by a initial exploration with a correlation matrix, giving an overview of all the numerical correlations.
  • First step is to create new dataframes from the dataframe of ralated variables and factors. It could using a dictionary to store them.
  • Next is to unpack dictionaries for better operation of dataframe data structures.
  • The index of dataframe is years, the columns are the exact variable name.
  • Finally, draw the Pearson matrix of the countries and selected two countries as represents of their groups in HIC and LIC.
In [106]:
# Create dictionaries to store each countries of variables and factors 
variable = {}
factor = {}
safety = {}

# Creating new dataframes using a loop
for country in countries:
 
    # Set the data of fatalities, injuries & killed as variables
    var1 = fat_inh[fat_inh['GEO'] == country]
    var2 = inj_kill[inj_kill['GEO'] == country]
    
    # Set new dataframe to save these two variables name as country
    variable_country = pd.DataFrame()
    for i, var in enumerate([var1,var2]):
        # The absence of data from that country in the original data may occur
        if not var.empty:
            selected_variable = var.transpose()
            variable_country = pd.concat([variable_country, selected_variable], axis=1)
        else:
            empty_index = ['GEO'] + [str(year) for year in range(2010, 2020)]
            var = pd.DataFrame(np.nan, index = empty_index, columns =[0])
            variable_country = pd.concat([variable_country,var],axis=1)
            
    # Rename the columns as their original name
    variable_country.columns = ['Road Fatalities',
                                'Injuries & Killed']
    
    # Add the variable values of this country to the dictionary
    variable[country] = variable_country
    
    # To distinguish and abbreviate, renaming dataframe into factors 1-7
    for idx,df in enumerate([road_inv, road_maint, df_mot_rate, df_mot_index, df_alcohol, df_unempl_rate, GDP],start=1):
        new_df = df[df['GEO'] == country]
        globals()[f'fac{idx}'] = new_df
    
    # Set new dataframe to save these two variables name as country
    factor_country = pd.DataFrame()
    for i, fac in enumerate([fac1,fac2,fac3,fac4,fac5,fac6,fac7]):
        if not fac.empty: 
            selected_factor = fac.transpose()
            factor_country = pd.concat([factor_country, selected_factor], axis=1)
        else:
            empty_index = ['GEO'] + [str(year) for year in range(2010, 2020)]
            fac = pd.DataFrame(np.nan, index = empty_index, columns = [0])
            factor_country = pd.concat([factor_country, fac], axis=1)
            
    # Rename the columns as their original name
    factor_country.columns = ['Road Investment',
                              'Road Maintainance',
                              'Motorway proportion',
                              'Motorization',
                              'Alcohol Consumption',
                              'Unemployment rate',
                              'GDP per capita']
    
    # Add the factors values of this country to the dictionary
    factor[country] = factor_country
    
    # Merge the dataframes of variable and factor into df safety and delete unnecessary lines
    safety_country = pd.concat([variable_country,factor_country],axis=1)
    safety_country = safety_country.iloc[1:]
    safety[country] = safety_country.replace(',', '', regex=True).astype(float)
    safety[country] = safety_country.dropna(axis=1)
In [40]:
# Unpack the dictionary
# Use a loop to read each DataFrame one by one in the dictionary 
df_safety = []  
for country, df in safety.items():
    # Set name of each countries' dataframe
    globals()[f"df_safety_{country}"] = df  
    df_safety.append(df)
Out[40]:
Road Fatalities Injuries & Killed Road Investment Motorization Alcohol Consumption Unemployment rate GDP per capita
2010 4.5 374,818 12,250,000,000 527.0 11.58 7.0 41,572.50
2011 5 396,374 12,290,000,000 534.0 11.33 5.8 46,705.90
2012 4.5 387,978 11,120,000,000 539.0 11.32 5.4 43,855.90
2013 4.1 377,481 11,390,000,000 543.0 11.08 5.2 46,298.90
2014 4.2 392,912 11,810,000,000 547.0 11.07 5.0 48,023.90
2015 4.2 396,891 11,430,000,000 548.0 11.06 4.6 41,103.30
2016 3.9 399,872 12,090,000,000 555.0 10.96 4.1 42,136.10
2017 3.8 393,492 13,520,000,000 561.0 10.86 3.8 44,652.60
2018 4 399,293 15,810,000,000 567.0 10.88 3.4 47,939.30
2019 3.7 387,276 16,750,000,000 574.0 10.56 3.1 46,793.70
In [108]:
# Using Poland as a representative of low-income countries
df_Finland = safety['Finland']

# Compute the Pearson's correlation coefficient
df_Finland = df_Finland.replace(',', '',regex=True).astype(float)
df_Finland_hm = df_Finland.corr(method = 'pearson')

# Make a heatmap to display the results
sns.heatmap(df_Finland_hm, annot = True)
plt.title('Pearson\'s correlation coefficient - Finland')
Out[108]:
Text(0.5, 1.0, "Pearson's correlation coefficient - Finland")
In [114]:
# Using Poland as a representative of low-income countries
df_Poland = safety['Poland']

# Compute the Pearson's correlation coefficient
df_Poland = df_Poland.replace(',', '',regex=True).astype(float)
df_Poland_hm = df_Poland.corr(method = 'pearson')

# Make a heatmap to display the results
sns.heatmap(df_Poland_hm, annot = True)
plt.title('Pearson\'s correlation coefficient - Poland')
Out[114]:
Text(0.5, 1.0, "Pearson's correlation coefficient - Poland")

From the heatmap above, several strong correlations can be seen. And the different groups High income(HIC) and low income(LIC) presents different patterns. In the group of HIC, represented by Finland. In the group of LIC, represented by Poland.

  1. The motorway proportion has less correlations with safety in HIC. But the LIC shows strong instead. This might because the construction of motorway has stabilized over these years in HIC. But with different situation, the countries in LIC is still in the period of increasing construction, and the annual construction mileage shows more higher, it is likely lead to high mortality rate.

  2. The coeficient between road fatalities, injuries&killed and the motorization are -0.81 and -0.98 respectively, which is very high. It can be inferred that when the motorization rate goes higher, the number of casualties will decrease. That is, they are anti correlated. LIC shows less correlations, this might because of the proposal of environmental law or some standard for motor vehicle ownership. The increase in residents' motor vehicles is decreasing year by year and tends to stabilize.

  3. In HIC, the correlationship between alcohol consumption and injuries is up to 0.95, the most significant among these 7 factors. And so as the LIC, they are all positively correlated. Although the data were available in the World Bank database, interpretations derived from these data would be questionable, as they refer to the total population (over 15 years) and do not necessarily represent the attitudes of drivers regarding alcohol consumption.

  4. While in the aspect of unemployment rate, the HIC did not result significant.But in LIC, the correlation is stronger. Just as we assumed: although the unemployment rate may be related to road safety, it is only a potential factor, and the actual impact may be mediated and influenced by other factors.

  5. GDP also has correlation with road safety in LIC, but shows weak in HIC. The beneficial effect of GDP on road safety may be due to the impact that greater GDP has improvement of the health care system and on the existence of betterpassive protection measures both in vehicles and infrastructure itself, such effects proving valid at least in LIC.

Part IV - Streamlit¶

This section shows the data used in the Streamlit application and contains a link to a working version.

  • Due to parallel working, the Streamlit application is a standalone mini project.
  • It answers a sub-question of the research question, namely: Is there a correlation between road safety (number of injuries & deaths) and road infrastructure expenditures? (We can assume that road infrastructure expenditures are linked to the GDP of a country.)
  • Hypothesis: More money spend on road infrastructure should reduce the number of injuries & deaths.
  • To get as much data as possible, its focus is not only on Europe, and has a larger time period of 1995-2021.
  • The entire story line is contained within the Streamlit application, therefore, we only show the data processing here.

The Streamlit application can be found here. The files on GitHub can be found under the Streamlit folder.

Data Used¶

All the data used for the Streamlit application was obtained on OECD Stat. Each dataset retrieved is for the time period 1995-2021 which was the largest time period possible for all the datasets.

The datasets contain the following information: Country, year, population, gdp and government spending in millions of local currencies, road investment and maintenance costs in local currency, the amount of road passenger km in millions and the amount of road injuries & deaths combined.

Data Pipeline¶

  • Obtain set of countries which appear in all datasets
  • Create intersection of all these country sets
  • Filter out all countries in the datasets not in the intersection set
  • Sort datasets on country alphabetically and year chronologically and reset indices such that data can be combined
  • Create new dataset for road safety
  • Create new dataset for road expenditures
  • Analyze and visualize the two created datasets using the Streamlit application

In order to have all this data on the same countries, we first find the set of countries which every dataset contains. To find the countries in a dataset, we use the following function:

In [2]:
# Creates a set of all available countries in a file
def check_available_countries(file):
    df = pd.read_csv(file)
    country_set = set()
    
    for country in df['Country']:
        country_set.add(country)
    
    return country_set

We then use this function to get the sets of countries for each dataset. This allows us to find the intersection between the sets and thus the available countries for all datasets.

In [3]:
# Get the available countries for each file
countries_road_fat = check_available_countries('streamlit/data/road_injuries_deaths.csv')
countries_road_inv = check_available_countries('streamlit/data/road_investment.csv')
countries_road_maint = check_available_countries('streamlit/data/road_maintenance.csv')
countries_road_pg = check_available_countries('streamlit/data/road_passengers.csv')
countries_pop = check_available_countries('streamlit/data/population.csv')
countries_gdp = check_available_countries('streamlit/data/gdp_government_spending.csv')

# Find the available countries in all files (the intersection)
available_countries = set.intersection(countries_road_fat, countries_road_inv, countries_road_maint, countries_road_pg, countries_pop, countries_gdp)

Now that we have the available countries, we remove any countries not in this set from all dataframes using the below function.

In [4]:
# This function removes countries not in the available countries set
def filter_available_countries(file):
    df = pd.read_csv(file)
    
    # Rename the time column to year as it contains years
    df.rename(columns={'Time': 'Year'}, inplace=True)
    df = df[df['Country'].isin(available_countries)]
    
    df = df['Year'].drop_duplicates().to_frame().merge(df['Country'].drop_duplicates(), how="cross").merge(df, how="left")
    
    # Store filtered file with the same name prepended with fil_
    df.to_csv(file.split('/')[0] + '/fil_' + file.split('/')[1], index=False)
In [ ]:
# Filter the files to only contain the available countries
filter_available_countries('streamlit/data/road_injuries_deaths.csv')
filter_available_countries('streamlit/data/road_investment.csv')
filter_available_countries('streamlit/data/road_maintenance.csv')
filter_available_countries('streamlit/data/road_passengers.csv')
filter_available_countries('streamlit/data/population.csv')
filter_available_countries('streamlit/data/gdp_government_spending.csv')

We now have datasets all containing the same countries and time period (were obtained using the same time period).

Next, we create required data for the analysis in the Streamlit application using the filtered datasets. In order to combine columns, the data needs to line up. Therefore, each dataframe is sorted on countries and years, and the index is reset.

In [ ]:
# Sorts dataframes based upon country names alphabetically and years chronologically
def sort_countries(df):
    return df.sort_values(by=['Country', 'Year'])

Road Safety data¶

To be able to compare countries, we look at injuries and passenger kilometres relative to a countries population. In the code below, this data is combined and stored.

In [ ]:
# Read in the required data
df_pop = pd.read_csv('streamlit/data/fil_population.csv')
df_inj = pd.read_csv('streamlit/data/fil_road_injuries_deaths.csv')
df_pas = pd.read_csv('streamlit/data/fil_road_passengers.csv')


# Sort the countries and years for the dataframes and reset the indices such that they are the same across dataframes
df_pop = sort_countries(df_pop)
df_pop.reset_index(drop=True, inplace=True)

df_inj = sort_countries(df_inj)
df_inj.reset_index(drop=True, inplace=True)

df_pas = sort_countries(df_pas)
df_pas.reset_index(drop=True, inplace=True)

# Create a new dataframe of the data that we require for the Streamlit application
df = pd.DataFrame(data={'Year': df_inj['Year'],
                        'Country': df_inj['Country'],
                        'Location': df_inj['COUNTRY'],
                        'Injuries': df_inj['Value'],
                        'Population': df_pop['Value'],
                        'Passenger_kilometres': df_pas['Value'],
                        'Injuries_passenger_kilometres': df_inj['Value'] / df_pas['Value'] * 1000,
                        'Percentage_inj_pop': df_inj['Value'] / df_pop['Population'] * 100,
                        'Percentage_inj_pk_pop': (df_inj['Value'] / df_pas['Value'] * 1000) / df_pop['Value'] * 100
                        })

# Save the created dataframe for use in the Streamlit application
df.to_csv('streamlit/data/app_data/road_safety.csv', index=False)

Road Expenditures data¶

To be able to compare countries, we look at road expenditures as percentage of the total expenditures as we only have expenditures in local currencies. In the code below, this data is combined and stored.

In [ ]:
# Read in the required data
df_gdp_gs = pd.read_csv('streamlit/data/fil_gdp_government_spending.csv')
df_rm = pd.read_csv('streamlit/data/fil_road_maintenance.csv')
df_ri = pd.read_csv('streamlit/data/fil_road_investment.csv')

# Split the combined GDP and government spending dataframe into two
df_gdp = df_gdp_gs[df_gdp_gs['STATISTICS'] == 'GDP']
df_gs = df_gdp_gs[df_gdp_gs['STATISTICS'] == 'T_PUB_EXP'].copy()

# Fill in missing data for Turkey in the government spending dataframe
for i in range(1995, 2000, 1):
    df_gs.loc[-i] = [i, 'Türkiye', 'TUR', 'T_PUB_EXP', 'Total Government Expenditure (mln)', 'LOCAL_CUR', 'Local currency', i, np.NaN, np.NaN, np.NaN]

# Sort the countries and years for the dataframes and reset the indices such that they are the same across dataframes
df_gdp = sort_countries(df_gdp)
df_gdp.reset_index(drop=True, inplace=True)

df_rm = sort_countries(df_rm)
df_rm.reset_index(drop=True, inplace=True)

df_ri = sort_countries(df_ri)
df_ri.reset_index(drop=True, inplace=True)

df_gs = sort_countries(df_gs)
df_gs.reset_index(drop=True, inplace=True)

# Create a new dataframe of the data that we require for the Streamlit application
df = pd.DataFrame(data={'Year': df_gdp['Year'],
                        'Country': df_gdp['Country'],
                        'Location': df_gdp['LOCATION'],
                        'GDP': df_gdp['Value']*1000000,          # GDP is in millions
                        'Gov_Spending': df_gs['Value']*1000000,  # Government spending is in millions
                        'Maintenance': df_rm['Value'],
                        'Investments': df_ri['Value']
                        })

# Create percentage of total government spending columns for road maintenance and investments
df['Perc_Maintenance'] = (df['Maintenance']/df['Gov_Spending'])*100
df['Perc_Investments'] = (df['Investments']/df['Gov_Spending'])*100

# Save the created dataframe for use in the Streamlit application
df.to_csv('streamlit/data/app_data/road_expenditures.csv', index=False)

Here we show the two created datasets: (They are also available on the Streamlit application on the Data page)

In [5]:
df_saf = pd.read_csv('streamlit/data/app_data/road_safety.csv')
df_saf
Out[5]:
Year Country Location Injuries Population Passenger_kilometres Injuries_passenger_kilometres Percentage_inj_pop Percentage_inj_pk_pop
0 1995 Belgium BEL 73568.0 10136814.0 104228.0 705.837203 0.725751 0.006963
1 1996 Belgium BEL 68259.0 10156641.0 104096.0 655.731248 0.672063 0.006456
2 1997 Belgium BEL 70907.0 10181246.0 107343.0 660.564732 0.696447 0.006488
3 1998 Belgium BEL 72260.0 10203012.0 112960.0 639.695467 0.708222 0.006270
4 1999 Belgium BEL 72543.0 10226419.0 116460.0 622.900567 0.709369 0.006091
... ... ... ... ... ... ... ... ... ...
697 2017 United States USA 2783000.0 325122128.0 6558301.0 424.347708 0.855986 0.000131
698 2018 United States USA 2747000.0 326838199.0 NaN NaN 0.840477 NaN
699 2019 United States NaN NaN 328329953.0 NaN NaN NaN NaN
700 2020 United States NaN NaN 331511512.0 NaN NaN NaN NaN
701 2021 United States NaN NaN 332031554.0 NaN NaN NaN NaN

702 rows × 9 columns

In [6]:
df_exp = pd.read_csv('streamlit/data/app_data/road_expenditures.csv')
df_exp
Out[6]:
Year Country Location GDP Gov_Spending Maintenance Investments Perc_Maintenance Perc_Investments
0 1995 Belgium BEL 2.104891e+11 1.108116e+11 5.800000e+07 1.600000e+08 0.052341 0.144389
1 1996 Belgium BEL 2.142871e+11 1.136839e+11 6.300000e+07 1.590000e+08 0.055417 0.139861
2 1997 Belgium BEL 2.241015e+11 1.156402e+11 6.000000e+07 1.530000e+08 0.051885 0.132307
3 1998 Belgium BEL 2.326238e+11 1.185435e+11 5.200000e+07 1.520000e+08 0.043866 0.128223
4 1999 Belgium BEL 2.423076e+11 1.223574e+11 5.800000e+07 1.610000e+08 0.047402 0.131582
... ... ... ... ... ... ... ... ... ...
697 2017 United States USA 1.947734e+13 7.412759e+12 5.144224e+10 9.660000e+10 0.693969 1.303159
698 2018 United States USA 2.053306e+13 7.794634e+12 6.063702e+10 1.023000e+11 0.777933 1.312441
699 2019 United States USA 2.138098e+13 8.221634e+12 6.150300e+10 1.077000e+11 0.748063 1.309959
700 2020 United States USA 2.106047e+13 9.962874e+12 6.134600e+10 1.063000e+11 0.615746 1.066961
701 2021 United States USA 2.331508e+13 1.047725e+13 NaN 1.086000e+11 NaN 1.036532

702 rows × 9 columns

Conclusion sub-question¶

The Streamlit application states on the correlation page that for some countries there is a correlation between money spend on road infrastructure and the amount of road injuries & deaths. However, it also shows that when looking at the road expenditures relative to the total expenditures, this correlation does not hold. We have to conclude that our hypothesis has been proven wrong.

The correlation that seems to hold for most countries is that for more recent years, the amount of injuries & deaths relative to road passenger kilometres declines. This could be the result of innovation in car technology, like automatic brakes, or changes to road infrastructure that have been a source of many accidents.

Conclusion¶

  • Road fatalities and injuries provide information on different aspects of road traffic accidents. Long term monitoring of trends in these two indicators can help evaluate the effectiveness of road safety improvements and whether measures taken can reduce accidents and mortality rates.
  • Through the research, we can find that in the case of some levels in the various countries, insufficient data were encountered for the entire time series. For example, the motorization rate, as we all know, it shows an increasing trend in the early 21st century, but But specific data for some countries were missing. This need further scientific verification.
  • Certainly that there are other relevant indicators can help to develop and evaluate road safety policies and plans.The variables and factors we use are just the tip of the iceberg. Looking forward to mastering a large amount of data for further qualitative analysis in the future.